首页> 外文OA文献 >Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent
【2h】

Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent

机译:用不一致随机数加速深度神经网络训练   梯度下降

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

SGD is the widely adopted method to train CNN. Conceptually it approximatesthe population with a randomly sampled batch; then it evenly trains batches byconducting a gradient update on every batch in an epoch. In this paper, wedemonstrate Sampling Bias, Intrinsic Image Difference and Fixed Cycle PseudoRandom Sampling differentiate batches in training, which then affect learningspeeds on them. Because of this, the unbiased treatment of batches involved inSGD creates improper load balancing. To address this issue, we presentInconsistent Stochastic Gradient Descent (ISGD) to dynamically vary trainingeffort according to learning statuses on batches. Specifically ISGD leveragestechniques in Statistical Process Control to identify a undertrained batch.Once a batch is undertrained, ISGD solves a new subproblem, a chasing logicplus a conservative constraint, to accelerate the training on the batch whileavoid drastic parameter changes. Extensive experiments on a variety of datasetsdemonstrate ISGD converges faster than SGD. In training AlexNet, ISGD is21.05\% faster than SGD to reach 56\% top1 accuracy under the exactly sameexperiment setup. We also extend ISGD to work on multiGPU or heterogeneousdistributed system based on data parallelism, enabling the batch size to be thekey to scalability. Then we present the study of ISGD batch size to thelearning rate, parallelism, synchronization cost, system saturation andscalability. We conclude the optimal ISGD batch size is machine dependent.Various experiments on a multiGPU system validate our claim. In particular,ISGD trains AlexNet to 56.3% top1 and 80.1% top5 accuracy in 11.5 hours with 4NVIDIA TITAN X at the batch size of 1536.
机译:SGD是训练CNN的广泛采用的方法。从概念上讲,它以随机抽样的批次来近似人口。然后通过在每个时期对每个批次进行梯度更新来均匀地训练批次。在本文中,我们演示了采样偏差,内在图像差异和固定周期伪随机采样在训练中区分批次,从而影响它们的学习速度。因此,SGD中涉及的批次的无偏对待会导致不正确的负载平衡。为了解决这个问题,我们提出了不一致的随机梯度下降(ISGD),以根据批次的学习状态动态地改变训练强度。 ISGD特别利用统计过程控制中的技术来识别训练不足的批次。一旦训练不足,ISGD会解决新的子问题,追赶逻辑加上保守的约束条件,从而在避免剧烈参数变化的情况下加快对批次的训练。在各种数据集上进行的大量实验表明,ISGD的收敛速度比SGD快。在训练AlexNet时,在完全相同的实验设置下,ISGD的速度比SGD快21.05%,达到top1精度的56%。我们还扩展了ISGD,使其可以在基于数据并行性的multiGPU或异构分布式系统上工作,从而使批处理大小成为可伸缩性的关键。然后从学习率,并行度,同步成本,系统饱和度和可扩展性等方面对ISGD批量大小进行了研究。我们得出结论,最佳ISGD批次大小取决于机器。在multiGPU系统上进行的各种实验证明了我们的主张。特别是,ISGD使用4NVIDIA TITAN X以1536的批处理量在11.5小时内将AlexNet的准确度训练为56.3%top1和80.1%top5。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号